Reducing OCR Errors in Gothic-Script Documents
نویسندگان
چکیده
In order to improve OCR quality in texts originally typeset in Gothic script, we have built an automated correction system which is highly specialized for the given text. Our approach includes external dictionary resources as well as information derived from the text itself. The focus lies on testing and improving different methods for classifying words as correct or erroneous. Also, different techniques are applied to find and rate correction candidates. In addition, we are working on a web application that enables users to read and edit the digitized text online.
منابع مشابه
Gabor Features Based Script Identification of Lines within a Bilingual/Trilingual Document
The OCR technology for Indian documents is in emerging stage and most of these Indian OCR systems can read the documents written in only a single script. As many commercial and official documents of different states of India are tri-lingual in nature, therefore identification of script and/ or language is one of the elementary tasks for multi-script document recognition. A script recognizer sim...
متن کاملA Comparative Analysis of Classifiers Accuracies for Bilingual Printed Documents (Oriya-English)
Bilingual document recognition has been the subject of intensive research and our focus is on the recognition of an Oriya-English bilingual documents. In most of our official papers, school text books, it is observed that English words interspersed within the Indian languages. So there is need for an Optical Character Recognition (OCR) system which can recognize these bilingual documents and st...
متن کاملScript Identification from Bilingual Gujarati-English Documents
In a multi-lingual country like India, in most of the official papers, school text books, magazines, it is observed that English words intersperse within the Indian regional languages. So a bilingual Optical Character Recognition (OCR) system is needed which can recognize these bilingual documents and store it for future use. In this paper authors present an OCR system developed for the script ...
متن کاملA Multiple Feature based Novel Approach for Identification of Printed Indian Scripts at Word Level
In a country like India where different scripts are in use, automatic identification of printed script facilitates many important applications such as automatic transcription of multilingual documents and for the selection of script specific OCR in a multilingual environment. In this paper a novel method to identify the script type of the collection of documents printed in seven Indian language...
متن کاملA Complete Machine printed Gurmukhi OCR System
Recognition of Indian language scripts is a challenging problem. Work for the development of complete OCR systems for Indian language scripts is still in infancy. Complete OCR systems have recently been developed for Devanagri and Bangla scripts. Research in the field of recognition of Gurmukhi script faces major problems mainly related to the unique characteristics of the script like connectiv...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- ERCIM News
دوره 2011 شماره
صفحات -
تاریخ انتشار 2011